ST590 Project 3

Jovanni Catalan & Sergio Mora

Introduction

You should discuss the goals of the notebook, introduce your data set, and give the source for your data set

The goal of this notebook is to have a clear understanding of obesity rates in Mexico, Peru, and Colombia based on multiple metrics collected. This data comes to us from UCI Machine Learning Repository which gathered this data from Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.

The data set has 17 columns and 2,111 observations. The columns are:

Gender: Patients gender {object}

Age: Patients age {float}

Height: Patients height {float}

Weight: Patients weight {float}

family_history_with_overweight: If the patients has a family history of overweight people {object}

FAVC: Frequent Consumption of High Caloric Food {objects}

FCVC: Frequent Consumption of Vegetables {float}

NCP: Number of Main Meals (how many meals the patient has daily) {float}

CAEC: Consumption of food between meals {object}

SMOKE: Does the patient smoke {object}

CH20: Consumption of water in liters {float}

SCC: Does the patient monitor the calories they consume {object}

FAF: How often does the patient have physical activity {float}

TUE: How often does the patient utilize technological devices {float} (e.g. phone, video games, TV's, computers, etc.) {float}

CALC: consumption of alcohol, how often does the patient drink alcohol {object}

MTRANS: What type of transportation does the patient normally use {object}

NObeyesdad: Patients weight status {object}

Obesity levels defined as:

Supervised Learning Idea and Data Split

Give a discussion as to why we want to what we are generally trying to do with supervised learning where prediction is our goal. Discuss why we want to split our data into a training and test set.

You should also split the data into a training and test set

EDA

You should have a narrative that goes through what you are trying to accomplish in the EDA, why you are looking at a particular graph or statistic, and how you interpret what you’ve made. The EDA should be done on the training data only. You should use pandas-on-spark or spark SQL data frames (but matplotlib is fine)

Part of the final’s purpose is to see if you can judge what should and shouldn’t be included in an EDA.

Gender

No real assumption is made here prior to observing the data as we have no reason to believe that either gender would be more likely to face obesity than the other.

Although not a huge data set it is still hard to understand our results in this format. The visualization below should help us out.

We see that overall our data is very evenly split when it comes to gender. This shouldn't come as a surprise to us. Further analysis should show if there is a correlation between gender and obesity rate.

Immediately we start to see some interesting feature of our data. We see the following:

Insufficient Weight: There are more women in who are of insufficient weight than men. This could have multiple reasons but one that comes to mind is the pressure on young women to thin.

Normal Weight: This is evenly split.

Obesity Type I: This is skewed male but not overly so.

Obesity Type II: Is predominantly male, this could be attributed to the way BMI is measure utilizing only weight and height and not higher than average muscle mass which many young men tend to have.

Obesity Type III: This is surprisingly almost entirely female. Because this is measure based on BMI it might stand to reason that if a man and a woman weigh the same a woman would likely have a higher BMI do to either height differences or assumed muscle mass differences.

Overweight Level I and II: These measurements seems to be fairly evenly split between gender with male being on the heavier side.

Smoke

An assumption made here is that smoking would correlate to someone being overweight and obese. The idea that one bad habit could lead to another as well as the assumption that smokers are less healthy because they smoke and thus might excessive less.

We see that the vast majority of our data set shows that people in mass do not smoke. For this reason further analysis on this variable would be hard to visualize without accounting for the near 49:1 ratio.

Age

Two opposing thoughts here are that younger people would be more fit due to their age and potentially being more active. However obese people might not make it to an older age to skew the data.

We see that our data is right tailed with a few data records showing people in their 40's, 50's and even 60's. We also do see that Obesity and overweight might be correlated to age since our Insufficient_Weight and Normal_Weight groups are both in the younger side of our distribution when compared to the other groups.

Due to our data being skewed we don't have huge insights into how obese people do later in life through the visual above. However a correlation test below will help with this.

Family History

"wealth begets wealth" is a common saying. Meaning wealth brings forth more wealth as in wealthy parents might raise a child who in turn will be wealthy. Family History might tell us a lot about someones likelihood of being obese. Here we will explore if "Obesity begets obesity".

We see that people with a history of Obesity of any kind seem to be more likely to be obese themselves. This seemed specially true for people that are obesity more than overweight.

Transportation Method

From both the bar chart above and the cross table we see that there seems to be some correlation between the way people move about and they weight. e.g. a lot of people who walk are in the normal weight category. A simple linear model could tell us the relationship better but all we know for now is that further analysis is needed.

FAVC

The assumption here is the frequent consumption of high caloric intake will increase someones likelihood of being obese.

Given the above visual and table it's kind of hard to identify a trend in the consumption of high caloric foods and weight gain. It seems like the majority of the individuals in this dataset frequently consume high caloric foods regardless of weight category.

Do to the fact that most of our data points consume high caloric food frequently it's hard to tell how much this variable influences our results. We would need to do some ration comparison but instead we can rely on the correlation below.

Correlation

Modeling

Next, you should fit three different classes of models (they can be the ones we did in class or you can branch out). You can have a numeric response or a binary response.

With each model type you use, you should describe the general idea of the model/how it works. These discussion don’t need to be super long, but they should be clear and hit on the most important points about how the model works.

You should use CV to choose among the candidate models for each model type.

• You should set up a pipeline in pyspark for each of your models

• At least one of the pipelines should include at least two transformations prior to the model fit (estimator)

• You can use the same set of transformations for multiple models (if appropriate)

We will start our models with a Simple Linear Regression to have good grasp of what it is we are looking at. We will also utilize a LASSO model to pick the best variables for our model with $\alpha$ as our tuning parameter.

Type I: LASSO Regression

Utilizing a LASSO model to get a general idea of what we our variables look like. The idea here is that we can utilize a linear model but penalize it the best way possible to fit our data. Further models will be more advanced/complex and hopefully will fit our model better.

We see a somewhat small alpha value meaning that our Lasso regression won't be too different than our linear regression. We hope that this helps us generalize our model to the point where we get the best results possible from a linear model. Next we will utilize other technique which we suspect will fit our data better.

With a low level for alpha our LASSO model is penalizing our linear regression ever so slightly meaning our LASSO model is "very" close to a simple linear regression.

We see that we start with a low RMSE to begging with. This doesn't tell us much since RMSE is utilize to compare models and currently we only have one model. Let's see if we can predict our data better with mode complex/advance models.

An $R^2$ here tells us that our model has a "decent" linear association with our Y variable. Meaning we are doing "OK" at predicting if someone will be obese of not.

Type II: Logistic Regression

Logistic regression models success probability(models avg. # of success for given x variables). The function never goes above 0 or 1. This is good since we have a binary response where we can say success is 1 and failure is 0.

The interpretation of the betas are a little different than that of the linear regression. In logistic regression the betas represent the change in log-odds in the response for a unit change in that respective x.

KNN Model

A Knn model will help us understand if someone is obese or not based on all the variables in the dataset. It does this by estimating if someone is obese based on the euclidean distance between other values.